Vision-language models (VLMs) that are pre-trained on large-scale image-text pairs have demonstrated impressive transferability on a wide range of visual tasks. Transferring knowledge from such powerful pre-trained VLMs is emerging as a promising direction for building effective video recognition models. However, the current exploration is still limited. In our opinion, the greatest charm of pre-trained vision-language models is to build a bridge between visual and textual domains. In this paper, we present a novel framework called BIKE which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We propose a Video Attribute Association mechanism which leverages the Video-to-Text knowledge to generate textual auxiliary attributes to complement video recognition. ii) We also present a Temporal Concept Spotting mechanism which uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner to yield enhanced video representation. The extensive studies on popular video datasets (ie, Kinetics-400 & 600, UCF-101, HMDB-51 and ActivityNet) show that our method achieves state-of-the-art performance in most recognition scenarios, eg, general, zero-shot, and few-shot video recognition. To the best of our knowledge, our best model achieves a state-of-the-art accuracy of 88.4% on challenging Kinetics-400 with the released CLIP pre-trained model.
translated by 谷歌翻译
Cross-view geo-localization aims to estimate the location of a query ground image by matching it to a reference geo-tagged aerial images database. As an extremely challenging task, its difficulties root in the drastic view changes and different capturing time between two views. Despite these difficulties, recent works achieve outstanding progress on cross-view geo-localization benchmarks. However, existing methods still suffer from poor performance on the cross-area benchmarks, in which the training and testing data are captured from two different regions. We attribute this deficiency to the lack of ability to extract the spatial configuration of visual feature layouts and models' overfitting on low-level details from the training set. In this paper, we propose GeoDTR which explicitly disentangles geometric information from raw features and learns the spatial correlations among visual features from aerial and ground pairs with a novel geometric layout extractor module. This module generates a set of geometric layout descriptors, modulating the raw features and producing high-quality latent representations. In addition, we elaborate on two categories of data augmentations, (i) Layout simulation, which varies the spatial configuration while keeping the low-level details intact. (ii) Semantic augmentation, which alters the low-level details and encourages the model to capture spatial configurations. These augmentations help to improve the performance of the cross-view geo-localization models, especially on the cross-area benchmarks. Moreover, we propose a counterfactual-based learning process to benefit the geometric layout extractor in exploring spatial information. Extensive experiments show that GeoDTR not only achieves state-of-the-art results but also significantly boosts the performance on same-area and cross-area benchmarks.
translated by 谷歌翻译
手语是人们表达自己的感受和情感的不同能力的窗口。但是,人们在短时间内学习手语仍然具有挑战性。为了应对这项现实世界中的挑战,在这项工作中,我们研究了运动传输系统,该系统可以将用户照片传输到特定单词的手语视频。特别是,输出视频的外观内容来自提供的用户图像,而视频的运动是从指定的教程视频中提取的。我们观察到采用最先进的运动转移方法来产生语言的两个主要局限性:(1)现有的运动转移工作忽略了人体的先前几何知识。 (2)先前的图像动画方法仅将图像对作为训练阶段的输入,这无法完全利用视频中的时间信息。为了解决上述局限性,我们提出了结构感知的时间一致性网络(STCNET),以共同优化人类的先前结构,并具有符号语言视频生成的时间一致性。本文有两个主要贡献。 (1)我们利用细粒骨骼检测器来提供人体关键点的先验知识。这样,我们确保关键点运动在有效范围内,并使模型变得更加可解释和强大。 (2)我们引入了两个周期矛盾损失,即短期周期损失和长期周期损失,这些损失是为了确保生成的视频的连续性。我们以端到端的方式优化了两个损失和关键点检测器网络。
translated by 谷歌翻译
多元时间序列预测已在各种领域(包括金融,交通,能源和医疗保健)中广泛范围的应用程序。为了捕获复杂的时间模式,大量研究设计了基于RNN,GNN和Transformers的许多变体的复杂神经网络体系结构。但是,复杂的模型在计算上通常是昂贵的,因此当应用于大型现实世界数据集时,在训练和推理效率方面面临严重的挑战。在本文中,我们介绍了Lightts,这是一种基于简单的基于MLP的结构的轻度深度学习体系结构。 LightT的关键思想是在两种微妙的下采样策略之上应用基于MLP的结构,包括间隔抽样和连续采样,灵感来自至关重要的事实,即下采样时间序列通常保留其大多数信息。我们对八个广泛使用的基准数据集进行了广泛的实验。与现有的最新方法相比,Lightts在其中五个方面表现出更好的性能,其余的性能可比性。此外,Lightts高效。与最大的基准数据集上的先前SOTA方法相比,它使用的触发器少于5%。此外,Lightts的预测准确性与以前的SOTA方法相比,在长序列预测任务中,预测准确性的差异要小得多。
translated by 谷歌翻译
在本报告中,我们向CVPR 2022中的EGO4D自然语言查询(NLQ)挑战介绍了Reler@zju-alibaba提交。给定视频剪辑和文本查询,该挑战的目标是确定视频的时间时刻剪辑可以获得查询的答案。为了解决这项任务,我们提出了一个多尺度的跨模式变压器和视频框架级对比度损失,以完全发现语言查询与视频剪辑之间的相关性。此外,我们提出了两种数据增强策略,以增加培训样本的多样性。实验结果证明了我们方法的有效性。最后的提交在排行榜上排名第一。
translated by 谷歌翻译
本文研究了如何实现更好,更有效的学习学习,以解决在有挑战性的多对象方案下应对半监督视频对象细分。最先进的方法学会用单个正对象解码特征,因此必须在多对象方案下分别匹配和分割每个目标,从而多次消耗计算资源。为了解决问题,我们提出了一个与变压器(AOT)方法的关联对象,以共同且协作匹配和解码多个对象。详细说明,AOT采用识别机制将多个目标关联到相同的高维嵌入空间中。因此,我们可以同时处理多个对象的匹配和分割解码,就像处理单个对象一样有效地解码。为了充分模型多对象关联,设计了长期的短期变压器(LSTT)来构建层次匹配和传播。基于AOT,我们进一步提出了一个更灵活,更健壮的框架,将对象与可扩展的变压器(AOST)相关联,其中LSTT的可扩展版本旨在实现准确性效率折衷的运行时间适应。此外,AOST引入了更好的层次方式,以使识别和视力嵌入。我们对多对象和单对象基准进行了广泛的实验,以检查AOT系列框架。与最先进的竞争对手相比,我们的方法可以保持运行时效率的时间和卓越的性能。值得注意的是,我们在三个受欢迎的基准测试(即YouTube-VOS(86.5%),Davis 2017 Val/Test/Test(87.0%/84.7%)和Davis 2016(93.0%)(93.0%)上,我们实现了新的最先进性能。项目页面:https://github.com/z-x-yang/aot。
translated by 谷歌翻译
Though transfer learning is promising to increase the learning efficiency, the existing methods are still subject to the challenges from long-horizon tasks, especially when expert policies are sub-optimal and partially useful. Hence, a novel algorithm named EASpace (Enhanced Action Space) is proposed in this paper to transfer the knowledge of multiple sub-optimal expert policies. EASpace formulates each expert policy into multiple macro actions with different execution time period, then integrates all macro actions into the primitive action space directly. Through this formulation, the proposed EASpace could learn when to execute which expert policy and how long it lasts. An intra-macro-action learning rule is proposed by adjusting the temporal difference target of macro actions to improve the data efficiency and alleviate the non-stationarity issue in multi-agent settings. Furthermore, an additional reward proportional to the execution time of macro actions is introduced to encourage the environment exploration via macro actions, which is significant to learn a long-horizon task. Theoretical analysis is presented to show the convergence of the proposed algorithm. The efficiency of the proposed algorithm is illustrated by a grid-based game and a multi-agent pursuit problem. The proposed algorithm is also implemented to real physical systems to justify its effectiveness.
translated by 谷歌翻译
Understanding objects is a central building block of artificial intelligence, especially for embodied AI. Even though object recognition excels with deep learning, current machines still struggle to learn higher-level knowledge, e.g., what attributes an object has, and what can we do with an object. In this work, we propose a challenging Object Concept Learning (OCL) task to push the envelope of object understanding. It requires machines to reason out object affordances and simultaneously give the reason: what attributes make an object possesses these affordances. To support OCL, we build a densely annotated knowledge base including extensive labels for three levels of object concept (category, attribute, affordance), and the causal relations of three levels. By analyzing the causal structure of OCL, we present a baseline, Object Concept Reasoning Network (OCRN). It leverages causal intervention and concept instantiation to infer the three levels following their causal relations. In experiments, OCRN effectively infers the object knowledge while following the causalities well. Our data and code are available at https://mvig-rhos.com/ocl.
translated by 谷歌翻译
视觉检索中的大多数现有方法是通过比较其全局特征向量的两种方式,该矢量错过了足够的信息并缺乏可解释性,检测图像或视频中的对象,并将文本与依赖复杂的模型设计或建模的精细元素对齐通过较低效率遭受视觉和文本令牌的交叉注意相互作用。为了解决这些局限性,最近的一些作品简单地汇总了代币的相似性以实现细粒度的对齐方式,但它们缺乏直观的解释,并且忽略了令牌级特征和具有高级语义的全球表示之间的关系。在这项工作中,我们重新考虑细粒度的跨模式对准,并为其设计一种新的模型不合命固式配方。我们还揭开了最近的流行作品的神秘面纱,并将其纳入我们的计划。此外,受最佳运输理论的启发,我们引入了\ emph {tokenflow},这是对拟议方案的实例化。通过仅修改相似性函数,我们方法的性能与主要视频文本检索基准上具有重型模型设计的SOTA算法相当。可视化进一步表明\ emph {tokenflow}成功利用细粒度的信息并获得了更好的解释性。
translated by 谷歌翻译
如今,自定进度的学习(SPL)是模仿人类和动物的认知过程的重要机器学习范式。 SPL制度涉及自定进度的正常化程序和逐渐增加的年龄参数,该参数在SPL中起着关键作用,但最佳地终止此过程仍然是不平凡的。一个自然的想法是计算解决方案路径W.R.T.年龄参数(即年龄 - 路径)。但是,当前的年龄段算法要么仅限于最简单的正常器,要么缺乏牢固的理论理解以及计算效率。为了应对这一挑战,我们提出了一个小说\下划线{g} Eneralized \ suespline {ag} e-path \ usewassline {a} lgorithm(gaga)spl,用于带基于普通微分方程(ODES)的各种自定为常规器的SPL,并设置控制,可以学习整个解决方案频谱W.R.T.一系列年龄参数。据我们所知,GAGA是第一个确切的途径遵循算法,该算法可以解决一般的自定为常规器的年龄段。最后,详细描述了经典SVM和Lasso的算法步骤。我们证明了GAGA在现实世界数据集上的性能,并在我们的算法和竞争基线之间找到相当大的加速。
translated by 谷歌翻译